Feature Engineering - create new features like:
I have selected to cluster on an 8 cluster model, and then explain the findings
Some guiding questions:
Per my understanding, it is unwise to claim a doctor who charges high as a fraud. This is because, there are several other factors which needs to be considered, before declaring a doctor as a fraud. The most basic ones being that the comparison, which says a doctor is a charging high, needs to be on the same scale, see below for the scale metrics.
A fair comparison indicates we are comparing apples to apples, that too the same type of apples. I mean the comparison needs to be in terms of the DSG or category of treatment, the location of the doctor or the hospital, and whether it is a general public hospital or a premium hospital, whose base charges of admission are very high no matter the treatment.
There are two types of anomolies, the first one is the one-off procedure that is extremely costly compared to the median cost charged by other hospitals in the same state. These instances were found out by scoring each treament which indicated how many times the cost of the treament was compared to others in the state.
The second type of anomoly were hospitals that overcharged treament to a lesser extent, but over the course of many treatments. This was found out by averageing the above mentioned score for all the treaments made by the hosptial, this averaged score became the hospital's score. Those with scores above 2 are flagged as hosptials that consistantly overcharged. A check was made to ensure that these hospitas were not just located in expensive cities as such cities tend to charge just an extra 20% to 70%.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import scipy
import time
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings("ignore")
import missingno as msno
from sklearn.impute import SimpleImputer
from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from pprint import pprint
import zipcodes
import plotly
import plotly.express as px
# Read the data
df = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 5 Clustering/inpatientCharges.csv')
df.head()
# Rows and Columns
print(" ")
print("Number of rows and columns in the dataset:")
df.shape
print(" ")
print("Basic statistics of the numerical columns are as follows:")
# Check basic statistics
df.describe()
# Check the column dtypes
df.info()
Check for categorical/object fields from the data variable descriptions. Convert the relevant numeric fields to their respective categorical/object fields:
Provider Id
The Zipcode column is represented as an integer. Convert it to zipcode format.
Variable Hospital Referral Region Description comprises of the State and the city, which I see is the nearest metro city.
Average Covered Charges is not significant for our analysis, it will be for other purposes such claims fraud, insurance premiums, etc.
The two payments columns need to be converted to proper numeric formats.
# Basic Sort of Provider ID and DRG Definition
df = df.sort_values(['Provider Id', 'DRG Definition'])
df.head(2)
The given column names have a lot of spaces, trailing spaces, etc. I will rename all the columns as per appropriate naming convention.
df.columns = ['DRG_Definition', 'Provider_Id', 'Provider_Name',
'Provider_Street_Address', 'Provider_City', 'Provider_State',
'Provider_Zip_Code', 'Hospital_Referral_Region_Description',
'Total_Discharges', 'Average_Covered_Charges',
'Average_Total_Payments', 'Average_Medicare_Payments']
df.columns
As discussed above, the Average_Covered_Charges is irrelevant for fraud detection purpose. So, I will drop that column altogether.
# df = df.drop(columns = ['Average_Covered_Charges'])
df["Average_Total_Payments"] = df["Average_Total_Payments"].str[1:].astype(float)
df["Average_Medicare_Payments"] = df["Average_Medicare_Payments"].str[1:].astype(float)
df["Provider_Id"] = df["Provider_Id"].astype(object)
df.head(2)
Provided_Zip_Code from integer to appropriate Zipcode format¶# Zipcode to 5 character integer zipcode format
df['Provider_Zip_Code'] = df['Provider_Zip_Code'].astype(str).str.zfill(5)
df.isnull().sum().sum()
There are no NA's, which is good for our analysis.
Provider_State variable.¶Regions will be a a very useful feature when performing the Exploratory Data Analysis.
states = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 4 EDA/states.csv')
states.head(5)
# Left join the new dataset
df = pd.merge(left = df, right = states, left_on = 'Provider_State', right_on = 'State Code', how = 'left')
df.head(2)
# Remove duplicate 'state' column
df = df.drop(columns = ['State', 'State Code'])
Dataset source: https://www.psc.isr.umich.edu/dis/census/Features/tract2zip/
This has the zipcode wise mean and median income data for 2006 to 2010
income_df = pd.read_excel('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 5 Clustering/MedianZIP-3.xlsx')
income_df['Zip'] = income_df['Zip'].astype(str).str.zfill(5)
income_df.head(3)
income_df.isnull().sum()
df = pd.merge(left = df, right = income_df, left_on = 'Provider_Zip_Code', right_on = 'Zip', how = 'left')
df = df.drop(columns = ['Zip','Mean'])
df.rename(columns={'Median':'Zip_Median_Income',
'Pop':'Zip_Population'}, inplace=True)
df.head(2)
.
.
DRG_Definition Distribution¶Explore the total number of DRG Definitions, and the count of how many times they appear.
df_count = df['DRG_Definition'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['DRG_Definition','Count']
df_count.head()
fig = px.bar(df_count, x = 'DRG_Definition', y = 'Count', color = 'DRG_Definition',
width=1450, height=500,
title = "Distribution of DRG Definitions")
fig.show()
The DRG Definition has a seemingly good distribution. The counts of DRG Definitons range from around 3000 to 600. All other DRG Definition counts lie within this range.
Here, any DRG Definition count doesnt seem like an outlier and all behave normally.
Provider_Name Distribution, see popular Providers¶Explore the total number Provider Names and the count of how many times each one appears.
df_count = df['Provider_Name'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_Name','Count']
df_count.head()
# Show only those Provider_Names whose total count is above 100
df_count1 = df_count.loc[df_count['Count'] > 100]
fig = px.bar(df_count1, x='Provider_Name', y='Count',
width=1200, height=500,
color = 'Provider_Name',
title = "Distribution of Provider Names, only showing for Count > 100")
fig.show()
# Show only those Provider_Names whose total count is below 3
df_count1 = df_count.loc[df_count['Count'] < 3]
fig = px.bar(df_count1, x='Provider_Name', y='Count',
width=1200, height=500,
color = 'Provider_Name',
title = "Distribution of Provider Names, only showing for Count < 3")
# fig.show()
From the above two count charts, it is clear than some Providers are extremely popular, and have around 600 entries. They seem to be the ones who provide services under multiple DRG Definitions.
While, some Providers are very unpopular, and have only 1 entry. Now, this depends on the DRG Definition, as some hospitals be a single specialty hospital, and hence everyone goes there only.
Provider_City Distribution, see popular Cities¶Explore the total number Cities and the count of how many times each one appears.
df_count = df['Provider_City'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_City','Count']
df_count.head()
# Show only those Provider_Cities whose total count is above 500
df_count1 = df_count.loc[df_count['Count'] > 500]
fig = px.bar(df_count1, x='Provider_City', y='Count',width=1000, height=500,
color = 'Provider_City',
title = "Distribution of Provider Cities, only showing for Count >500")
# fig.show()
# Show only those Provider_Cities whose total count is below 5
df_count1 = df_count.loc[df_count['Count'] < 5]
fig = px.bar(df_count1, x='Provider_City', y='Count',width=1000, height=500,
color = 'Provider_City',
title = "Distribution of Provider Cities, only showing for Count > 5")
# fig.show()
Chicago is the most popular city with around 1500 entries. There are also a lot of other cities which have only 1 entry.
Provider_State Distribution, see popular States¶Explore the total number States and the count of how many times each one appears.
df_count = df['Provider_State'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_State','Count']
df_count.head()
fig = px.bar(df_count, x='Provider_State', y='Count',
width=1000, height=500,
color = 'Provider_State',
title = "Distribution of Provider State")
# fig.show()
The states seem to have a good distribution. There seems to be no outliers or staes requiring special attention.
Average_Total_Payments Distribution¶fig = px.histogram(df, x="Average_Total_Payments",
width=1000, height=500,
title = "Distribution of Average Total Payments")
# fig.show()
fig = px.box(df, x="Average_Total_Payments",width=1000, height=500,
title = "Distribution of Average Total Payments")
fig.show()
fig = px.violin(df, y="Average_Total_Payments", box=True,
points='all',width=800, height=500,
title = "Distribution of Average Total Payments")
# fig.show()
Most of the Average payments are less than USD 11,000. So, any average payment above that might be a reason for futher investigation.
There are some extreme high values, more than USD 150,000 which may need investigation.
Average_Medicare_Payments Distribution¶fig = px.histogram(df, x="Average_Medicare_Payments",
width=1000, height=500,
title = "Distribution of Average Medicare Payments")
# fig.show()
fig = px.box(df, x="Average_Medicare_Payments",width=1000, height=500,
title = "Distribution of Average Medicare Payments")
# fig.show()
Most of the Average Medicare payments are less than USD 10,000. So, any average payment above that might be a reason for futher investigation.
There are some extreme high values, more than USD 150,000 which may need investigation.
Total_Discharges Distruibution¶fig = px.histogram(df, x="Total_Discharges",
width=800, height=500,
title = "Distribution of Total Discharges")
# fig.show()
fig = px.box(df, x="Total_Discharges",width=1000, height=500,
title = "Distribution of Total Discharges")
# fig.show()
Most of the Total Discharges are less than 49.
There are some extreme high values, such as 3383, which may need investigation.
Average_Total_Payments by DRG_Definition¶fig = px.box(df, x="DRG_Definition", y="Average_Total_Payments",width=1400, height=500,
color = "DRG_Definition",
title = "The Distribution of Average Total Payments by DRG Definition")
fig.show()
Some DRG's have a very high Average Total Payments, these may be critical operations, which bear high cost.
Average_Total_Payments by State¶fig = px.box(df, x="Provider_State", y="Average_Total_Payments",width=1000, height=500,
color = "Provider_State",
title = "The Distribution of Average Total Payments by Provider State")
fig.show()
The Average Total Payments are more or less similar, but some states such as NY and CA are very expensiove overall.
Average_Total_Payments by Region¶fig = px.box(df, x="Region", y="Average_Total_Payments",width=1000, height=500,
color = "Region",
title = "The Distribution of Average Total Payments by Region")
# fig.show()
# px.violin(df,x='Average_Total_Payments', y = "Region", color='Region',
# title = "The Distribution of Average Total Payments by Region",
# orientation='h').update_traces(side='positive',width=2)
The West region seems to be generally high in terms of Total Average Payments. This was verified earlier as we saw the state CA was extremely high as well.
It is followed by Northeast, which includes the state NY.
Total_Discharges by DRG_Definition¶fig = px.box(df, x="DRG_Definition", y="Total_Discharges",width=1400, height=500,
color = "DRG_Definition",
title = "The Distribution of Total Discharges by DRG Definition")
# fig.show()
The Discharge rate for some DRG's is very high, while most others have a balanced discharged rate.
Total_Discharges by Region¶fig = px.box(df, x="Region", y="Total_Discharges",width=1000, height=500,
color = "Region",
title = "The Distribution of Total Discharges by Region")
# fig.show()
# px.violin(df,x='Total_Discharges', y = "Region", color='Region',
# title = "The Distribution of Total Discharges by Region",
# orientation='h').update_traces(side='positive',width=2)
Most regions have a similar total discharged pattern. However, the Northeast region has an outlier.
Average_Total_Payments and Average_Medicare_Payments¶fig = px.scatter(df, x="Average_Total_Payments", y="Average_Medicare_Payments",
size = "Average_Total_Payments", color = 'Average_Total_Payments',
size_max=60,width=800, height=600)
# fig.show()
As the average total payments increase, the average medicare payments also increase, which shows that there is a very high collinearity between these two variables.
agg_columns = ['mean', 'median', 'var', 'std', 'count', 'min', 'max']
groupby_drg = df[['DRG_Definition', 'Average_Total_Payments']].groupby(by='DRG_Definition').agg(agg_columns)
groupby_drg.columns = [header + '-' + agg_column
for header, agg_column in zip(groupby_drg.columns.get_level_values(0), agg_columns)]
groupby_drg.columns = groupby_drg.columns.get_level_values(0)
groupby_drg.reset_index(inplace=True)
groupby_drg['Average_Total_Payments-range'] = groupby_drg['Average_Total_Payments-max'] - groupby_drg['Average_Total_Payments-min']
groupby_drg.head(2)
def plt_setup(_plt):
_plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom='off', # ticks along the bottom edge are off
top='off', # ticks along the top edge are off
labelbottom='off')
plt.figure(figsize=(20,8))
sns.barplot(x='DRG_Definition', y='Average_Total_Payments-mean',
data=groupby_drg.sort_values('Average_Total_Payments-mean'))
plt_setup(plt)
plt.title('Mean Average Total Payments by DRG', fontsize=16)
plt.ylabel('Mean of Average Total Payments', fontsize=16)
Some DRG groups have very high mean, which implies that there are some DRG groups which generally charge a very high amount for treatment in terms of 'Total Payments'.
pyt_by_drg = df.groupby('DRG_Definition').sum().reset_index()
pyt_by_drg = pyt_by_drg.sort_values('Average_Total_Payments', ascending=False)
pyt_by_drg.head()
# Extract only rows with amount > 40,000,000
pyt_by_drg = pyt_by_drg.loc[pyt_by_drg['Average_Total_Payments'] > 40000000]
# plt.figure(figsize=(20,4))
# fig = sns.barplot(x='DRG_Definition', y='Average_Total_Payments',
# data=pyt_by_drg)
# fig.set_xticklabels(fig.get_xticklabels(), rotation=15)
# plt.title('Mean Average Total Payments by DRG, for total > 40,000,000', fontsize=16)
# plt.ylabel('Mean of Average Total Payments', fontsize=16)
The DRG 329 is the highest in terms of total sum fom the Average Total Payments.
unique_ids = len(df.groupby('Provider_Id').count())
unique_providers = len(df.groupby('Provider_Name').count())
unique_cities = len(df.groupby('Provider_City').count())
unique_states = len(df.groupby('Provider_State').count())
print(" ")
print(f'There are {unique_ids} unique provider id values in the data, and {unique_providers} unique provider names in a total of {unique_cities} unique cities, and {unique_states} states.')
print(" ")
fig = sns.pairplot(df[['Region', 'Total_Discharges', 'Average_Total_Payments','Average_Medicare_Payments']],
hue= 'Region')
fig
corr = df[['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments']].corr()
f,ax = plt.subplots(figsize=(7,5))
sns.heatmap(corr, annot=True, cmap='Reds', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
From above graphs, there are some variables that are highly correlated such as Average Total Payment and Average Medicare Payment. Average total payment has a long tail distribution, which could indicate potential fraud.
From corr matrix: Total payment is correlated with medicare payment.
We can conclude that those variables are indeed related, for modeling purposes, it more make sense to include only one or two of the three variables.
plt.figure(figsize=(20,20))
g = sns.PairGrid(df,
x_vars = ['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments'],
y_vars = ['Provider_State'],
height=10, aspect=.25)
# Draw plot
g.map(sns.stripplot, size=10, orient="h",
palette="ch:s=1,r=-.1,h=1_r", linewidth=1.10, edgecolor="w")
# Use the same x axis limits on all columns and add better labels
g.set(xlabel="", ylabel="")
titles = ["Total Discharges", "Average Total Payments",
"Average Medicare Paymens"]
for ax, title in zip(g.axes.flat, titles):
# Set a different title for each axes
ax.set(title = title)
# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)
# plt.tight_layout()
# plt.show()
plt.figure(figsize=(20,20))
g = sns.PairGrid(df,
x_vars = ['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments'],
y_vars = ['Region'],
height=10, aspect=.25)
# Draw plot
g.map(sns.stripplot, size=10, orient="h",
palette="ch:s=1,r=-.1,h=1_r", linewidth=1.10, edgecolor="w")
# Use the same x axis limits on all columns and add better labels
g.set(xlabel="", ylabel="")
titles = ["Total Discharges", "Average Total Payments",
"Average Medicare Paymens"]
for ax, title in zip(g.axes.flat, titles):
# Set a different title for each axes
ax.set(title = title)
# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)
# plt.tight_layout()
# plt.show()
.
.
Feature 1¶By understanding the top DSG's per state, we esablish a baseline for what people in the state normally get treated for. This information is useful in one of two ways:
Assuming that the fraudulent users would try to get treated for the same conditions as the population.
drg_by_state = df.groupby(['Provider_State', 'DRG_Definition']).agg({'DRG_Definition': 'count'})
drg_by_state.head()
drg_by_state.tail()
I am not merging this with the original dataset as this is a new data table created to be used as a reference to check the most common DRG's by state
Feature 2¶This information can be used in one of two ways:
providers_per_city = df.groupby(['Provider_City']).agg({'Provider_Name':'count'})
providers_per_city.head()
fig = plt.figure(figsize=(10,7))
providers_per_city.sort_values(by = 'Provider_Name', ascending = False).head()
sns.distplot(providers_per_city)
plt.tight_layout()
I am not merging this with the original dataset as this is a new data table created to be used as a reference to check the most common hospitals.
df['Average_Cost_Per_Procedure'] = df['Average_Total_Payments']/df['Total_Discharges']
df.head(2)
fig = plt.figure(figsize=(15,7))
plt.boxplot(df['Average_Cost_Per_Procedure'], vert=False)
plt.title('Averate Cost Per Procedure')
plt.xlabel("Cost in $")
plt.tight_layout()
df['Medicare_%_Paid'] = df['Average_Medicare_Payments']/df['Average_Total_Payments']
df.head(2)
fig = plt.figure(figsize=(15,7))
sns.boxplot(df['Medicare_%_Paid'])
plt.title('Percentage of Average Total Payments Paid by Medicare (Average)')
plt.tight_layout()
Most average medicare payments are around 80-90% of the average total payments.
medicare_pct_state = df.groupby('Provider_State').agg({'Medicare_%_Paid': 'mean'}).reset_index()
medicare_pct_state.head(5)
df = pd.merge(left = df, right = medicare_pct_state, left_on = 'Provider_State', right_on = 'Provider_State',
how = 'left')
df.rename(columns = {'Medicare_%_Paid_x':'Medicare_%_Paid',
'Medicare_%_Paid_y':'Medicare_%_Paid_State'}, inplace=True)
df.head(2)
fig = px.scatter(df, x="Provider_State", y="Medicare_%_Paid_State", width=1000, height=500,
color = "Provider_State",
title = "Medicare % Paid Distribution (by State)")
fig.show()
fig = plt.figure(figsize=(15,7))
sns.distplot(df['Medicare_%_Paid_State'])
plt.title('Medicare % Paid Distribution (by State)')
plt.tight_layout()
df['Out_of_Pocket_Payment'] = df['Average_Total_Payments'] - df['Average_Medicare_Payments']
df.head(2)
sorted_avg_out_of_pocket = df.groupby(['DRG_Definition']).agg({'Out_of_Pocket_Payment': 'mean'})
sorted_avg_out_of_pocket.sort_values(by = 'Out_of_Pocket_Payment',ascending=False).head()
fig = plt.figure(figsize=(15,7))
sns.distplot(df['Out_of_Pocket_Payment'])
plt.title('Medicare_%_Paid_Distribution_by_State')
plt.tight_layout()
df['Out_of_Pocket_per_discharge'] = df['Out_of_Pocket_Payment']/df['Total_Discharges']
df.head(2)
fig = plt.figure(figsize=(15,7))
sns.distplot(df['Out_of_Pocket_per_discharge'])
plt.tight_layout()
patients_states = df['Provider_State'].value_counts()
patients_states = pd.DataFrame(patients_states).reset_index()
patients_states.columns = ['Provider_State','Count']
patients_states.head()
df = pd.merge(left = df, right = patients_states, left_on = 'Provider_State', right_on = 'Provider_State',
how = 'left')
df.rename(columns = {'Count':'State_Total'}, inplace=True)
df.head(2)
fig = px.scatter(df, x="Provider_State", y="State_Total", width=1000, height=500,
color = "Provider_State",
title = "Total Procedures/Treatments per state")
fig.show()
fig = plt.figure(figsize=(15,7))
sns.distplot(df['State_Total'])
plt.tight_layout()
patient_avg_state = df.groupby('Provider_State').mean()[['Total_Discharges',
'Average_Total_Payments',
'Average_Medicare_Payments']].reset_index()
patient_avg_state.head()
patient_avg_state.loc[:,'Total_Discharges':'Average_Medicare_Payments'].corr()
fig = plt.figure(figsize=(15,10))
plt.subplot(2, 2, 1)
plt.boxplot(patient_avg_state['Total_Discharges'])
plt.title('Total Disharge Box plot')
plt.xlabel('')
plt.subplot(2, 2, 3)
plt.boxplot(patient_avg_state['Average_Total_Payments'])
plt.title('Average Total Payment Boxplot')
plt.xlabel('')
plt.subplot(2, 2, 4)
plt.boxplot(patient_avg_state['Average_Medicare_Payments'])
plt.title('Average Medicare Payment Boxplot')
plt.tight_layout()
plt.show()
# States with highest discharges
patient_avg_state.sort_values(by = 'Total_Discharges', ascending = False).head()
# States with highest Average Total Payments
patient_avg_state.sort_values(by = 'Average_Total_Payments', ascending = False).head()
These new mean columns are useful, but I believe Median columns will be more handy. So, I will not add the mean columns to the original dataset yet.
Feature 10¶To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.
median_drg_state = df.groupby(['DRG_Definition','Provider_State'])['Average_Total_Payments'].median().reset_index()
median_drg_state.head()
df = pd.merge(left = df, right = median_drg_state, left_on = ['DRG_Definition','Provider_State'], right_on = ['DRG_Definition','Provider_State'], how = 'left')
df.rename(columns={'Average_Total_Payments_y':'Median_Avg_Total_Pyt',
'Average_Total_Payments_x':'Average_Total_Payments'}, inplace=True)
df.head(2)
# Check for one particular state and one particular DRG, to see median
df[(df['Provider_State'] == 'NV') & (df['DRG_Definition'] == '194 - SIMPLE PNEUMONIA & PLEURISY W CC')].head(2)
Feature 11¶I will now take the average total payment and divide it by the median payment to generate a simple score that indicates how many times the current payment amount is larger than the median amount.
df['Median_Score'] = df['Average_Total_Payments']/df['Median_Avg_Total_Pyt']
df.head(2)
fig = plt.figure(figsize=(10,5))
sns.distplot(df['Median_Score'])
plt.tight_layout()
fig = plt.figure(figsize=(15,5))
sns.boxplot(df['Median_Score'])
plt.tight_layout()
df['Median_Score'].describe()
It would appear that most treatment payment amounts for the same DRG within the same state are within 90% to 110% of the median price. This is expected as normally doctors should be charging similar prices for similar treatment in simlar areas. However, we see instances where the payment amount is many times that of the median.
As we see in the box plot above, in two specific cases, treament cost over 9 times the median amount.
Let us investigate further.
df[df['Median_Score'] >= 6]
The highest median score is 9.33 which is for:
This individual was charged USD 41,482 for a treament that had median amount of just USD 4,441.92. This particular treatment was performed at Provider - STURDY MEMORIAL HOSPITAL.
suspect_hospital1 = df[df['Provider_Name'] == 'STURDY MEMORIAL HOSPITAL']['Median_Score']
fig = plt.figure(figsize=(14,5))
sns.distplot(suspect_hospital1)
plt.tight_layout()
print(" ")
print("Median Score distribution for Provider - STURDY MEMORIAL HOSPITAL is as follows")
print(" ")
print(suspect_hospital1.describe())
The graphical representation is strange. The hospital typically charges less than the median amount for its treatments as we see that the highest number of observations fall beloew the median score of 1. This makes the treament with the median mulitple score of over 9 highly unusal.
Lets examine this hosptial's track record too.
suspect_hospital2 = df[df['Provider_Name'] == 'ST JOSEPH MEDICAL CENTER']['Median_Score']
fig = plt.figure(figsize=(14,5))
sns.distplot(suspect_hospital2)
plt.tight_layout()
print(" ")
print("Median Score distribution for Provider - ST JOSEPH MEDICAL CENTER is as follows")
print(" ")
print(suspect_hospital2.describe())
Again, this is a hospital that typically charges resonable prices for treatment, most observations are below the median score of 2, which is fine. But, this makes the one treatment that is over 9 times the median price an extreme outlier deserving of extra attention.
Making the broad assumption that around 1% of medical payments are fraudulent. I will tag any medical treatment that paid more than the top 99th percentile of median_scores in the dataset.
np.percentile(df['Median_Score'], 99)
1% of medical treatments cost more than 1.79 times the median payment of that treament by state. Treaments that paid more than this shall be flagged.
Feature 12¶To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.
df['Median_Score_Flag'] = df['Median_Score'] >= np.percentile(df['Median_Score'], 99)
df.head(2)
This approach is good for finding providers that overcharge substancially. However, it is not so useful in find hospitals that overcharge slightly but over the course of many treatments. One way to find these providers is to find which ones have the highest average median_score.
Feature 13¶To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.
Median_Score_by_Provider = df.groupby(['Provider_Name']).mean()['Median_Score'].reset_index()
print(Median_Score_by_Provider.head(2))
fig = plt.figure(figsize=(10,5))
sns.distplot(Median_Score_by_Provider['Median_Score'])
plt.tight_layout()
print(" ")
print("Median Score distribution for Provider - is as follows")
print(" ")
print(Median_Score_by_Provider.describe())
Some providers consistantly charge multiple times the median payment amount. However, before we blow the horn on these hospitals, we have to consider the fact that perhaps these hospitals are charging more than their state counter parts because they are either high-end/luxury hosptials and/or they are located in expensive cities. Lets see what hospitals these are.
df = pd.merge(left = df, right = Median_Score_by_Provider, left_on = 'Provider_Name',
right_on = 'Provider_Name', how = 'left')
df.rename(columns={'Median_Score_x':'Median_Score',
'Median_Score_y':'Median_Score_by_Provider'}, inplace=True)
df.head(2)
I had expected these hospitals to either be luxury hospitals or located in expensive cities.
df[df['Median_Score_by_Provider'] >= 2][['Provider_Name', 'Provider_State','Provider_City']].drop_duplicates()
Feature 14¶I set the benchmark at an upper level of 2 or higher median score, this will avoid situations where the cost of the city impacts the cost of treatment as even in the most expensive cities, the average cost of treament will in an expensive city will never be double the average cost of treament outside the city within the same state.
df['Provider_Flag_by_Median_Score'] = df['Median_Score_by_Provider'] >= 2
df.head(2)
df[df['Provider_Flag_by_Median_Score']][['Provider_Id','Provider_Name','Provider_City',
'Provider_State','Median_Score_by_Provider']].drop_duplicates()
I will now pick the hospital with the highest score, which is - CANCER TREATMENT CENTERS OF AMERICA.
Now, I assume this hospital constantly overcharges on procedures that are much cheaper in other hospitals within the same state. Lets see it below.
df[df['Provider_Name'] == 'MEMORIAL HOSPITAL LOS BANOS'][['Median_Score']]
As we see, this is an effective method to track and find hospitals who overcharge or committ fraud.
Feature 15¶I will check the grand total of the number of discharges made by ech Provider. The hypothesis is that the hospitals with the highest number of diacharges are more susceptible to fraud, having false patients and fake claims.
discharges_provider = df.groupby(['Provider_Id','Provider_Name'])['Total_Discharges'].sum().reset_index(name='Grand Total of Discharges')
discharges_provider = discharges_provider.sort_values(by='Grand Total of Discharges', ascending=False)
discharges_provider.head()
It is seen that the highest discharges are by 'Florida Hospital', at 25,828.
However, it would be ideal to check the total discharges as group be Provider State to get better inference.
discharges_provider_state = df.groupby(['Provider_State','Provider_Id', 'Provider_Name'])['Total_Discharges'].sum().reset_index(name='Grand Total of Discharges')
discharges_provider_state = discharges_provider_state.sort_values(by='Grand Total of Discharges', ascending=False)
discharges_provider_state.head()
Feature 16¶This ratio will give the an idea as to whether the average payments are higher or lower than the median income of the population. The hypothesis is that if the ratio is high, then the persons in that zip code are paying much higher than their median income for the treatment, which might be a fraud case, as the patient/hospital might have inflated bills.
Even the same hospital may have lower ratio for a particular DRG or State, but higher for another one, so I will not group the data. I need individual ratio for each row entry.
df['Avg_Payment_by_Median_Income'] = df['Average_Total_Payments'] / df['Zip_Median_Income']
df.head(2)
fig, ax = plt.subplots(figsize= (20,5))
sns.boxplot(df["Avg_Payment_by_Median_Income"])
plt.title("Distribution of Average Total Payments by Zipcode Median Income", fontsize=20)
The ones which are over 5 might need some investigation.
Feature 17¶Hospitals havings very ratio are likely to be showing false patients.
Those with higher ratio, might need further investigation. Even the same hospital may have lower ratio for a particular DRG or State, but higher for another one, so I will not group the data. I need individual ratio for each row entry.
df['Total_Disc_by_Pop'] = df['Total_Discharges'] / df['Zip_Population']
df.head(2)
fig, ax = plt.subplots(figsize= (20,5))
sns.boxplot(df["Total_Disc_by_Pop"])
plt.title("Distribution of Total Discharges by Zipcode Population", fontsize=20)
The ratio's over 10 might need investigation.
.
.
I will create a new copy of the data to perform the analysis, and call it 'df1'.
# Create a copy of the original dataset
df1 = df.copy()
# Check nulls
df1.isnull().sum().sum()
These NA's are created as a reuslt of merging the Income and the Population data by zipcode. There are certain zip codes in our data set, which do not have an entry n the zipcode dataset.
So, I will replace the NA's with the median of the respective column.
df1 = df1.fillna(df1.median())
df1.isnull().sum().sum()
There a lot of columns which will not be useful in the kmeans clustering analysis, most of which are categorical columns. I will drop them.
df1 = df1.drop(columns = ['Provider_Id', 'Provider_Name', 'Provider_Street_Address',
'Provider_City', 'Provider_Zip_Code',
'Hospital_Referral_Region_Description', 'Average_Covered_Charges',
'Region',
'Division', 'Zip_Median_Income',
'Zip_Population', 'Median_Score_Flag',
'Provider_Flag_by_Median_Score'])
df1.head(2)
#cat_variables = ['DRG_Definition', 'Provider_State']
#df = pd.get_dummies(
# df,
# columns = cat_variables,
# drop_first=True
#)
#df.head(2)
Label en-coding will be misleading for the clustering, and one-hot encoding creates many different new binary features, which is not ideal for a kmeans clustering.
df1 = df1.drop(columns = ['DRG_Definition', 'Provider_State'])
I will check multi-collinearity in all the features, and drop the ones with a very high cor ratio.
# All numeric / float variables in thedataset
num_variables_all = ['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments',
'Average_Cost_Per_Procedure', 'Medicare_%_Paid',
'Medicare_%_Paid_State', 'Out_of_Pocket_Payment',
'Out_of_Pocket_per_discharge', 'State_Total', 'Median_Avg_Total_Pyt',
'Median_Score', 'Median_Score_by_Provider',
'Avg_Payment_by_Median_Income',
'Total_Disc_by_Pop']
corr = df1[num_variables_all].corr()
f,ax = plt.subplots(figsize=(15,10))
sns.heatmap(corr, annot=True, cmap='Reds', linewidths=.4, fmt= '.1f', ax=ax)
plt.show()
# Remove one each from the pair of highly collinear variables
df1 = df1.drop(columns = ['Average_Medicare_Payments', 'Median_Avg_Total_Pyt', 'Out_of_Pocket_per_discharge',
'Average_Cost_Per_Procedure', 'Median_Score_by_Provider',
'Avg_Payment_by_Median_Income'])
# Final numeric variables selected
num_variables = ['Total_Discharges', 'Average_Total_Payments',
'Medicare_%_Paid',
'Medicare_%_Paid_State', 'Out_of_Pocket_Payment',
'State_Total', 'Median_Score',
'Total_Disc_by_Pop']
df1.columns
I will use Standard Scalar
X = StandardScaler().fit_transform(df1[num_variables])
X = pd.DataFrame(data = X, columns = num_variables)
X.head()
# df1[num_variables] = X
# df1.head(2)
X.isnull().sum().sum()
I will perform my clustering on a range of 1-20 to get a comprehensive view.
# Define Within Cluster Sum of Squares, as an empty list
wcss = []
#this loop will fit the k-means algorithm to my data and
#second I will compute the within cluster sum of squares and
#append to our wcss list.
for i in range(1,20):
kmeans = KMeans(n_clusters=i, max_iter=300)
model = kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1,20), wcss, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
kmeans1 = KMeans(n_clusters = 8, init ='k-means++', max_iter=300, n_init=10,random_state=0)
kmeans1.fit(X)
# Add back the clusters labels to our data set
cluster = kmeans1.labels_
df['cluster'] = kmeans1.labels_
df1['cluster'] = kmeans1.labels_
# cluster size, number of observations in each cluster
temp = df['cluster'].value_counts()
df['cluster'].value_counts()
# Calculate the percentage of the points/rows in each cluster:
temp['Percentage of total'] = (df['cluster'].value_counts() / df['cluster'].value_counts().sum()) * 100
temp = pd.DataFrame(temp['Percentage of total']).reset_index()
temp.rename(columns={'index':'cluster',
'cluster':'Percentage of total'}, inplace=True)
temp
From the above 8 cluster analysis, the following can be explained:
A. Clusters 5, 3, and 1 have a very small percentage of total data in their cluster pool. So, there are two possiblr options for these clusters:
B. Cluster 7 has a roughly 10% of the data in its cluster pool. This is the critical cut-off point, as I believe anything near or 10% of the entire data, represents a different but normal cluster group.
C. The other clusters which have over 10% of the total data in their cluster pools, seem normal and fine. The differences lie probably due to state or the drg definition. But they are not not likely to be a cause for anomalies.
Clusters and see the mean of the all the clusters¶This will help identify anomaly containing clusters, and induce further inspection.
display(df.groupby('cluster').mean().reset_index()) # cluster means, grouped by cluster
The above table gives a comprehensive view of the means, as per the different clusters.
From the 14 features used for clustering, I generated 8 different clusters. The following are the key takeaways:
Total Discharges
Average Total Payments
Average Medicare Payments
Average Cost per Procedure
Total Discharges (Disc) by Population
Average Payment by Median Income
Out of Pocket Payment
Median_Score_by_Provider
First I will try to plot a combination of two features, it will help to undertsand the clusters better, and their distribution
For visualization, I have used the following code as reference: https://www.kaggle.com/ellecf/visualizing-multidimensional-clusters
sns.lmplot(x='Average_Total_Payments', y='Median_Score',data=df,hue='cluster',fit_reg=False)
sns.lmplot(x='Out_of_Pocket_Payment', y='Median_Score',data=df,hue='cluster',fit_reg=False)
sns.lmplot(x='Medicare_%_Paid', y='Median_Score',data=df,hue='cluster',fit_reg=False)
sns.lmplot(x='Medicare_%_Paid', y='Average_Total_Payments',data=df,hue='cluster',fit_reg=False)
sns.lmplot(x='Out_of_Pocket_Payment', y='Average_Total_Payments',data=df,hue='cluster',fit_reg=False)
sns.lmplot(x='Total_Discharges', y='Average_Total_Payments',data=df,hue='cluster',fit_reg=False)
sns.lmplot(x='Total_Disc_by_Pop', y='Average_Total_Payments',data=df,hue='cluster',fit_reg=False)
sns.lmplot(x='Total_Discharges', y='Total_Disc_by_Pop',data=df,hue='cluster',fit_reg=False)
Above, I plotted the pairs of two features, and tried to see the clusters, and the specific data points which seem to be as anomalies. They are all color-coded, by each data point by the cluster to which it was assigned.
#This is just to add something constant for the strip/swarm plots' X axis. Can be anything you want it to be.
df['Constant'] = "Data"
# create a 3x5 grid of empty figures where we will plot our feature plots. We will have a couple empty ones.
f, axes = plt.subplots(3, 5, figsize=(25, 25), sharex = False)
#Scooch em apart, give em some room
f.subplots_adjust(hspace=0.3, wspace=0.7)
# In this for loop, I step through every column that I want to plot. This is a 3x5 grid, so I split this up by rows of 5 in the else if statements
for i in range(0,len(list(df[num_variables_all]))):
col = df[num_variables_all].columns[i]
if i < 5:
ax = sns.stripplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],jitter=True,ax=axes[0,(i)])
ax.set_title(col)
elif i >= 5 and i<10:
ax = sns.stripplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],jitter=True,ax=axes[1,(i-5)]) #so if i=6 it is row 1 column 1
ax.set_title(col)
elif i >= 10 and i<15:
ax = sns.stripplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],jitter=True,ax=axes[2,(i-10)])
ax.set_title(col)
This following will generate a series of strip plots. Seaborn plots one data point for each row and we've color coded the points by the cluster to which they were assigned. Adding jitter fans out the points horizontally. In a strip plot, the points can overlap. In a swarm plot (below), the points cannot overlap.
#f, axes = plt.subplots(3, 5, figsize=(25, 25), sharex = False)
#f.subplots_adjust(hspace=0.3, wspace=0.7)
#for i in range(0,len(list(df[num_variables_all]))):
# col = df[num_variables_all].columns[i]
# if i < 5:
# ax = sns.swarmplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],ax=axes[0,(i)])
# ax.set_title(col)
# elif i >= 5 and i<10:
# ax = sns.swarmplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],ax=axes[1,(i-5)])
# ax.set_title(col)
# elif i >= 10 and i<15:
# ax = sns.swarmplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],ax=axes[2,(i-10)])
# ax.set_title(col)
For the entire healtcare fraud and anomaly detection assignment, I created 19 EDA's displaying the various distributions, relations and characteristics of the variables. I also created 17 new features.
For performing the kmeans, I ussed 8 clusters as the optimum number of clusters. An important observation was that: